Preparing the Data
Learn how to prepare your data for anomaly detection.
We'll cover the following
Data in server logs#
Application servers such as Nginx, Apache, and IIS write very useful information to access logs. The data in these logs can be instrumental in identifying anomalies.
We will analyze logs of a web application, so the data we are most interested in is the timestamp and the status code of every response from the server. To illustrate the type of insight, we can draw from this data:
- A sudden increase in the
500
status code: You may have a problem with the server. Did you push a new version? Is there an external service you are using that started failing in unexpected ways? - A sudden increase in the
400
status code: You may have a problem with the client. Did you change some validation logic and forgot to update the client? Did you make a change and forgot to handle backward compatibility? - A sudden increase in the
404
status code: You may have an SEO problem. Did you move some pages and forgot to set up redirects? Is there some script kiddy running a scan on your site? A sudden increase in the200
status code: You either have some significant legit traffic coming in or are under a DOS attack. Either way, you probably want to check where it is coming from.
Preparing the data#
Parsing and processing logs are outside the scope of this course but let’s assume we did that. By the end, we’ll have a table that looks like this:
The table stores the number of entries for each status code at a given period. For example, our table stores how many responses returned each status code every minute:
Note that the table has a row for every minute, even if the status code was never returned in that minute. Given a table of statuses, it is very tempting to do something like this:
This is a common mistake, that can leave you with gaps in the data. Zero is a value, and it holds a significant meaning. A better approach is to create an “axis” and join to it:
First, we should generate an axis using a cartesian join between the status codes we want to track and the times we want to monitor. To generate the axis, we can use two nice features of PostgreSQL:
generate_series
: A function that generates a range of values.VALUES
list: A special clause that can generate “constant tables,” as the documentation calls it. You might be familiar with theVALUES
clause fromINSERT
statements. In the old days, to generate data, we had to use a bunch ofSELECT
…UNION ALL
… UsingVALUES
is much nicer.
After generating the axis, we left to join the actual data to get a complete series for each status code. The resulting data has no gaps and is ready for analysis.